Soft-Error Tolerant Cache Architectures
نویسندگان
چکیده
The problem of soft errors caused by radiation events are expected to get worse with technology scaling. This thesis focuses on mitigation of soft errors to improve the reliability of memory caches. We survey existing mitigation techniques and discuss their issues. We then propose 1) a technique that can mitigate soft errors in caches with lower costs than the widely-used Error Correcting Code (ECC), 2) a technique to mitigate soft errors in Content Addressable Memories, and 3) a cost-effective cache architecture achieving both variation-induced defect and soft-error tolerance. ECC is widely used to detect and correct soft errors in memory caches. Maintaining ECC on a per-word basis, which is preferred for caches with word-based access, is expensive. Chapter 3 proposes Zigzag-HVP, a costeffective technique to detect and correct soft errors for such caches. ZigzagHVP utilizes horizontal-vertical parity (HVP). Basic HVP can detect and correct a single bit error (SBE), but not a multi-bit error (MBE). By dividing the data array into multiple HVP domains and interleaving different domains, a spatial MBE can be converted to multiple SBEs, each of which can be detected and corrected by the corresponding parity domain. Vertical parity update and error recovery in Zigzag-HVP can be performed efficiently by modifications to the cache data paths, write-buffer, and Built-In Self Test. Evaluation results indicate that the area and power overheads of Zigzag-HVP caches are lower than those of ECC-based ones. Chapter 4 proposes STCAM, a soft-error tolerant Content-Addressable Memory (CAM). Soft-error mitigation in a CAM is difficult due to the unavailability of data outside the cell array in a CAM access. Since CAMs are used in several components of a processor, making those CAMs being resilient against soft errors is required to attain high processor’s reliability. STCAM can successfully detect and correct false hits and false misses caused by soft errors in a CAM. This is achieved through subdividing a CAM and
منابع مشابه
Partially Protected Caches to Reduce Failures due to Soft Errors in Mission-Critical Multimedia Systems
With advances in process technology, soft errors are becoming an increasingly critical design concern. Soft errors are manifested as a toggle in Boolean logic, which may result in failure of the system functionality. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Although Error Correction Code (ECC) based mechanisms have been suggested ...
متن کاملPartially Protected Caches to Reduce Failures due to Soft Errors in Multimedia Applications1
With advances in process technology, soft errors are becoming an increasingly critical design concern. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Based on the observation that in multimedia applications, not all data require the same amount of protection from soft errors, we propose a Partially Protected Cache (PPC) architecture, in...
متن کاملProbabilistic Soft Error Detection Based on Anomaly Speculation
Microprocessors are becoming increasingly vulnerable to soft errors due to the current trends of semiconductor technology scaling. Traditional redundant multithreading architectures provide perfect fault tolerance by re-executing all the computations. However, such a full re-execution technique significantly increases the verification workload on the processor resources, resulting in severe per...
متن کاملResilient On-Chip Memory Design in the Nano Era
OF THE DISSERTATION Resilient On-Chip Memory Design in the Nano Era By Abbas BanaiyanMofrad Doctor of Philosophy in Computer Science University of California, Irvine, 2015 Professor Nikil Dutt, Chair Aggressive technology scaling in the nano-scale regime makes chips more susceptible to failures. This causes multiple reliability challenges in the design of modern chips, including manufacturing d...
متن کاملSoft Coherence: Preliminary Experiments with Error-Tolerant Cache Coherence in Numerical Applications
As we scale into the multi-core era, we face severe challenges in the scalability and performance of on-chip cache-coherent shared memory mechanisms. We explore application error-tolerance as an extra degree of freedom to meet these challenges. Iterative numerical algorithms, in particular, can cope with the occasional stale value with little or no effect on accuracy or convergence time. We exp...
متن کامل